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Abstract 

Let us assume that / is a continuous function defined on the unit ball of M. d , of 
the form fix) = g(Ax), where A is a k x d matrix and g is a function of k variables 
for k <C d. We are given a budget m £ N of possible point evaluations f(xi), 
i = 1, . . . , m, of /, which we arc allowed to query in order to construct a uniform 
approximating function. Under certain smoothness and variation assumptions on 
the function g, and an arbitrary choice of the matrix A, we present in this paper 

1. a sampling choice of the points {xi} drawn at random for each function 
approximation; 

2. algorithms (Algorithm 1 and Algorithm 2) for computing the approximating 
function, whose complexity is at most polynomial in the dimension d and in the 
number m of points. 

Due to the arbitrariness of A, the choice of the sampling points will be according 
to suitable random distributions and our results hold with overwhelming probabil- 
ity. Our approach uses tools taken from the compressed sensing framework, recent 
Chernoff bounds for sums of positivc-scmidefinite matrices, and classical stability 
bounds for invariant subspaccs of singular value decompositions. 
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1 Introduction 



1.1 Learning high dimensional functions from few samples 

In large scale data analysis and learning, several real-life problems can be formulated 
as capturing or approximating a function denned on O C M d with dimension d very 
large, from relatively few given samples or queries. The usual assumption on the class 
of functions to be recovered is smoothness. The more regular a function is, the more 
accurately and the more efficiently it can be numerically approximated. However, in 
the field of information based complexity it has been clarified that such a problem is in 
general intractable, i.e., it does not have polynomial complexity. To clarify this poor 
approximation phenomenon, assume 

T d : = {/ : [0, l] d -»• R, p Q /|U < 1, a € Nq}, 

to be the class of smooth functions we would like to approximate. We define the 
sampling operator S n = <j>o N, where N : Td — > W n is a suitable measurement operator 
and : M n — > L oo ([0, l] d ) a recovery map. For example N can take n samples /(xj), 
i = 1, . . . ,n of / and <fi can be a suitable interpolation operator. The approximation 
error provided by such a sampling operator is given by 

e(S n ) := sup ||/ - Sn(/)||oo- 

With this notion we further define the approximation numbers 

e(n,d) := inf e(S n ), 

indicating the performance of the best sampling method, and 

n(e,d) := inf{n : e(n, d) < s}, (1) 

which is the minimal number of samples we need for the best sampling method to 
achieve a uniform accuracy e £ (0, 1). 

1.2 Intractability results 

Recent results by Novak and Wozniakowski |24] state that for a uniform approximation 
over Td we have e(n, d) = 1 for all n < 2^/ 2 J - 1 or n(e, d) > 2^/ 2 J for all e G (0, 1). 
Hence, the number of samples to approximate even a C°° -function grows exponentially 
with the dimension d. This result seems to obliterate any hope for an efficient solution 
of the learning problem in high dimension, and this phenomenon is sometimes referred 
to as the curse of dimensionality. 

Nevertheless, very often the high dimensional functions which we can expect as so- 
lutions to real-life problems exhibit more structure and eventually are much better 
behaved with respect to the approximation problem. There are several models cur- 
rently appearing in the literature for which the approximation problem is tractable, i.e., 
the approximation error does not grow exponentially with respect to the dimension d. 
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According to the behavior of the information complexity n(e,d), cf. ([I]), for small 
e > and large d S N, one speaks about 

• polynomial tractability: if n(e, d) depends polynomially on e~ l and d, 

• strong polynomial tractability: if n(e,d) depends polynomially only on 

• weak tractability: if lim — - ' — - = 0. 

e- 1 +d^oo E 1 + d 

We point to [23\ Chapters 1 and 2] for further notions of tractability and many refer- 
ences. 

In the next two subsections we will recount a few relevant approaches leading in 
some cases to (some sort of) tractability. 

1.3 Functions of few variables 

A function / : [0, l] d — >• R of d variables (d large) may be a sum of functions, which 
only depend on k variables (k small): 

m 

f(xi,...,x d ) = ^2g i (x il ,...,x ik ). (2) 

e=i 

In optimization such functions are called partially separable. This model arises for 
instance in physics, when we consider problems involving interaction potentials, such 
as the Coulomb potential in electronic structure computations, or in social and eco- 
nomical models describing multiagent dynamics. Once k is fixed and d — > oo, the 
learning problem of such functions is tractable, even if the ge are not very smooth. We 
specifically refer to the recent work of De Vore, Petrova, and Wojtaszczyk [13] which 
describes an adaptive method for the recovery of high dimensional functions in this 
class, for m = 1. 

This model can be extended to functions which are only approximatively depending 
on few variables, by considering the unit ball T-Ld,^ of the weighted Sobolev space of 
functions / : [0, l] d ->■ R with 



2 . 

[d] 



f(x)\ dx<l, (3) 



where [d] := {1, . . . , d}, and 7 := {jd,u} are non-negative weights; the definition := 
and the choice of jd,u = leads us again to the model ([2]). A study of the tractability 
of this class, for various weights, can be found in 



1.4 Functions of one linear parameter in high dimensions 

One of the weaknesses of the model classes introduced above is that they are very 
coordinate biased. It would be desirable to have results for a class of basis changes 
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which would make the model basis-independent. A general model assumes that, 

f(x) = g(Ax), (4) 

for A an arbitrary k x d matrix. While solution to this unconstrained problems have 
so far been elusive, the special case of 

f(x) = g(a ■ x), (5) 

where a is a stochastic vector, i.e., a = (a±, . . . , a^), aj > 0, Yl'j=i a j = 1j an d 
g : [0, 1] — > ]R is a C s function for s > 1 has been fully addressed with an optimal 
recovery method in 

The aim of this work is to find an appropriate formulation of the general model 
@, which generalizes both the model of k active coordinates as well as the model of 
one stochastic vector, and to analyze the tractability of the corresponding approxima- 
tion problem. The rest of the paper is organized as follows. After introducing some 
basic notations, the next section is dedicated to the motivation and discussion of the 
generalized model. As an introduction to our formulation and solution approach, we 
then proceed to analyze the simple case of one active direction in Section O under 
milder assumptions on the vector a = (a±, . . . , a^), before finally addressing the fully 
generalized problem in Section HJ The last section is dedicated to the discussion of 
further extensions of our approach, to be addressed in successive papers. 

1.5 Notations 

In the following we will deal exclusively with real matrices and we denote the space 
ofnxm real matrices by M nxm . The entries of a matrix X are denoted by lower 
case letters and the corresponding indices, i.e., X^ = Xij. The transposed matrix 
X T E M mxn of a matrix X E M nxm is the matrix with entries xfj = Xj{. For 
X G M n 

xm we can write its (reduced) singular value decomposition [19j as 

X = UY,V T 

with U S M nxp , V € M mxp , p < min(n,m), matrices with orthonormal columns and 
E = diag(cri, . . . , dp) G M pxp a diagonal matrix where o\ > 02 > • • • > a p > are the 
singular values. For specific matrices X we write the singular value decomposition 

X = U(X)Z(X)V(X) T = Ux^xVl- 

For symmetric, positive semidefinite matrices, i.e., X = X T and v T Xv > for all 
vectors v, we can take V = U and the singular value decomposition is equivalent to 
the eigenvalue decomposition. Note also that o~i{X) = y Xi(X T X), where Xi(X T X) is 
the i th largest eigenvalue of the matrix X T X (actually, this holds for n > m, whereas 
we may want to consider XX T instead of X T X if m > n). The rank of X G M nxm 
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denoted by rank(X) is the number of nonzero singular values. We define the Frobenius 
norm of a matrix X as 

1/2 



1*11, == £ 



It is also convenient to introduce the vector norms 



l/p 



\x\\m :- 



a=l 



We denote by I n G M nxn the identity matrix. The symbol i?Kn stands for the unit 
ball and B^n(r) for the ball of radius r > in R n . The unit sphere in R n is denoted 
by S n . Finally, C n indicates the Lebesgue measure in R n . 



2 The General Model f{x) = g(Ax) and Its Simplifications 

The first approach one may be tempted to consider to a generalization of ([5]) is to ask 
that / : [0, l] d — > R is of the form /(a:) = g(Ax), where A is a k x d stochastic matrix 
with orthonormal rows, i.e., > 0, Sj=i a «j = 1 f° r all i = 1, . . . , &;, ^4j4 t = Ik, and 
5 : -A([0, l] d ) — > R is a C s function for s > 1. There are however two main problems 
with this formulation. The conditions of stochasticity and orthonormality of the rows 
of A together are very restrictive - the only matrices satisfying both of them are those 
having only one non-negative entry per column - and the domain of g cannot be chosen 
generically as [0, l] k but depends on A, i.e., it is the k-dimensional polytope A([0, l] d )- 
Thus we will at first return to the unconstrained model in Q and give up the conditions 
of stochasticity and orthonormality. This introduces rotational invariance for the rows 
of A and the quadrant defined by [0, l] d is no longer set apart as search space. In 
consequence and to avoid the complications arising with the polytope A([0, l] d ) we 
will therefore focus on functions defined on the Euclidean ball. 

To be precise, we consider functions / : B^d(l + e) — > R of the form Q, where A is an 
arbitrary k x d matrix whose rows are in £ d , for some < q < 1, 

(^EkA <Ci. 

Further, we assume, that the function g is defined on the image of £? R d(l + e) under the 
matrix A and is twice continuously differentiable on this domain, i.e., g G C 2 (AB R d(l + 
e)), and 

max ||Z) ai o|| co < Ci- 

For /igd-i the uniform surface measure on the sphere S rf_1 we define the matrix 

Hf := I Vf{x)Vf{x) T d Hd -i{x). (6) 
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From the identity V/(x) = A T \7g(Ax) we get that 

Hf = A T ■ [ Vg{Ax)Vg{Ax) T d^ d -. (x) ■ A, (7) 

and therefore that the rank of is k or less. We will require to be well condi- 
tioned, i.e., that its singular values satisfy a\{H^) > • • • > ak{H^) > a > 0. 
The parameters in our model are the dimension d (large), the linear parameter dimen- 
sion k (small), the nonnegative constants C\, C2, < q < 1, and < a < kC 2 . 
We now show that such a model can be simplified as follows. First of all we see that 
giving up the orthonormality condition on the rows of A was actually unnecessary. Let 
us consider the singular value decomposition of A = UY,V T , hence we rewrite 

f(x) = g(Ax) = ~g(Ax), AA T = I k , 

where g(y) = g(lTEy) and A = V T . In particular, by simple direct computations, 

• SU P|«|<2 ll-^fflloo < sup| a |< 2 ||-D Q 5lloo ■ m.ayi{Vkai (A) , kai (A) 2 } , and 

• (Ei=iia«l , ') 1/9 <^(A)- 1 fc 1 /ff- 1 /a. 

Hence, by possibly considering different constants C\ = k 1 l q ~ l l 2 ak{A)~ l Ci and C2 = 
max we can always assume that AA T = 1^, meaning A is row- 

orthonormal. Note that for a row-orthonormal matrix A, equation ([7]) tells us that the 
singular values of are the same as those of H g , where 

H g := [ Vg(Ax)Vg(Axfdfi §d -i(x). 

The following simple result states that our model is almost well-defined. As we will 
see later, the conditions on A and / will be sufficient for the unique identification of / 
by approximation up to any accuracy, but not necessarily for the unique identification 
of A and g. 

Lemma 2.1. Assume that f(x) = g(Ax) = g(Ax) with A, A two k x d matrices such 
that AA T = Ik = AA T and that H* has rank k. Then A = OA for some k x k 
orthonormal matrix O. 

Proof. Because A and A are row-orthonormal the singular values of H g and Hg are the 
same as those of H?, i.e., we have H g = UT,U T and Hg = UT,U T , where S is a k X k 
diagonal matrix containing the singular values of in nonincreasing order and U, U 
are orthonormal k x k matrices. Inserting this into ([7]) we get 

H f = A T H g A = A T UY>U T A 
= A 1 HgA = A T UY>U T A. 

U T A and U T A are both row-orthonormal, so we have two singular value decompositions 
of H*. Because the singular vectors are unique up to an orthonormal transform, we 
have U T A = VU T A for some orthonormal matrix V or A = OA for O = UVU T , 
which is by construction orthonormal. □ 



With the above observations in mind, let us now restate the problem we are address- 
ing and summarize our requirements. We restrict the learning problem to functions 
/ : B Rd (l + e) R of the form f(x) = g(Ax), where A G M kxd and AA T = I k . 
As we are interested in recovering / from a small number of samples, the accuracy 
will depend on the smoothness of g. In order to get simple convergence estimates, we 
require g G C 2 (B R k(l + e)). These choices determine two positive constants C±, C2 for 
which 

Em 9 ] ^> ( § ) 

and 

sup police < C 2 . (9) 

|a|<2 

For the problem to be well-conditioned we need that the matrix H* is positive definite 

a 1 (Hf)>--->a k (Hf)>a, (10) 
for a fixed constant a > (actually later we may simply choose a = a k {H^)). 

Remark 1. Let us shortly comment on condition (|10p in the most simple case k = 1, 
by showing that such a condition is actually necessary in order to formulate a tractable 
algorithm for the uniform approximation of f from point evaluations. 
The optimal choice of a is given by 

cf. Theorem \3.7\ Furthermore, we consider the function g G C 2 ([— 1 — e, 1 + e]) given by 
g(y) = 8(y — 1/2) 3 for y G [1/2, 1 + e] and zero otherwise. Notice that, for every a G M. d 
with \\a\\pd = 1, the function f(x) = g(a • x) vanishes everywhere on S d_1 outside of 

the cap U(a,l/2) := {x G : a • x > 1/2}, see Figured The fi§d-i measure of 

U(a, 1/2) obviously does not depend on a and is known to be exponentially small in d 
f2l\l . see also Section \3.3[ Furthermore, it is known, that there is a constant c > 
and unit vectors a 1 , . . . , a K , such that the sets U(a 1 ,1/2), . . . ,U(a K , 1/2) are mutually 
disjoint and K > e cd . Finally, we observe that max^ggd-i |/(x)| = f(a) = g(l) = 1. 

We conclude that any algorithm making only use of the structure of f(x) = g{a-x) 
and the condition ([9]) needs to use exponentially many sampling points in order to 
distinguish between f(x) = and f(x) = g(a t ■ x) for some of the a 1 's as constructed 
above. Hence, some additional conditions like ([8]) and (|10[) are actually necessary to 
avoid the curse of dimensionality and to achieve at least some sort of tractability. Let 
us observe that a = a(d) decays exponentially with d for the function g considered 
above. We shall further discuss the role of a in Section VJ.'Ji 

Contrary to the approach in [TTJ our strategy used to learn functions of the type ([!]) 
is to first find an approximation A to A. Once this is known, we will give a pointwise 
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Figure 1: The function g and the spherical cap U(a, 1/2). 

definition of the function g on B^k (1) such that f{x) := g(Ax) is a good approximation 
to / on B^a(l). This will be in a way such that the evaluation of g at one point will 
require only one function evaluation of /. Consequently, an approximation of g on its 
domain B R k(l) using standard techniques, like sampling on a regular grid and spline- 
type approximations, will require a number of function evaluations of / depending only 
on the desired accuracy and k, but not on d. We will therefore restrict our analysis to 
the problem of finding A, defining g, and the amount of queries necessary to do that. 

3 The One Dimensional Case k = 1 

For the sake of an easy introduction, we start by addressing our recovery method again 
in the simplest case of a ridge function 

f(x)=g(a-x), (12) 

where a = (a±, . . . , a^) G K rf is a row vector, ||a||^d = 1, and g is a function from the 
image of B R d(l + e) under a to R, i.e., g : -Br(1 + e) — > M. 

The ridge function terminology was introduced in the 1970's by Logan and Shepp [22] in 
connection with the mathematics of computer tomography. However these functions 
have been considered for some time, but under the name of plane waves. See, for 
example, [121 [20]. Ridge functions and ridge function approximation are studied in 
statistics. There they often go under the name of projection pursuit. Projection 
pursuit algorithms approximate a function of d variables by functions of the form 

l 

f{x) » ^2gj(aj ■ x). (13) 
i=i 

Hence the recovery of / in (|12|) from few samples can be seen as an instance of the 
projection pursuit problem. For a survey on some approximation-theoretic questions 
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concerning ridge functions and their connections to neural networks, see [27] and refer- 
ences therein, and the work of Candes and Donoho on ridgelet approximation [3E1E]. 
For further clarity of notations, in the following we will assume a to be a row vector, 
i.e., alxd matrix, while other vectors, x, £, <p . . . , are always assumed to be column 
vectors. Hence the symbol a ■ x stands for the product of the 1 x d matrix a with the 
d x 1 vector x. 



3.1 The Algorithm 

As in [11] a basic ingredient of the algorithm is a version of Taylor's theorem giving 
access to the vector a. For £ G B R d, <p £ B R d(r), e, r S R+, with re < e, we have, by 
Taylor expansion, the identity 

= /({ + ^- /(0 -^VV(CM, (14) 

for a suitable ((£,,p) £ ^M d (l + <0- Thanks to our assumptions ([8|) and ([9]), the term 
[(/3 T V 2 /(C)(/9] is uniformly bounded as soon as ip is bounded. We will consider the 
above equality for several directions (fi and at several sampling points £j. 
To be more precise we define two sets X , $ of points. The first 

X = {ij eS" :j = l,...,m x }, (15) 

contains the mx sampling points and is drawn at random in § rf_1 according to the 
probability measure (i§d-i. For the second, containing the m$ derivative directions, 
we have 

, „ , /- , . , 1 f 1, with probability 1/2, 

$ = <^ <fi £ B Rd NdJm^) :ipn= = < ' . iU , , 1 ' 

yri k\ /v *y y-« \ -1, with probability 1/2, 

i = 1, . . . , m$, and £ = 1, . . . , d} . (16) 

Actually we identify $ with the m<j> x ci matrix whose rows are the vectors ipi. To write 
the mx x m$ instances of (j!4j) in a concise way we collect the directional derivatives 
g'(a ■ £j)a, j = 1, . . . , as columns in the d x mx matrix X, i.e., 

A = ( 5 '(a-e 1 )a T ,..., 5 , (a-£ m; ,)a T ), (17) 

and we define the m$ x mx matrices Y and £ entrywise by 

mj _ /te+^.)-/te) , m 



and 
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We denote by yj the columns of Y and by e,- the columns of £, j = 1, . . . , mx- With 
these matrices we can write the following factorization 

$X = Y - £. (20) 

The algorithm we propose to approximate the vector a is now based on the fact that the 
matrix X has a very special structure, i.e., X = a T Q T ' , where Q = (g'(a ■ £1), . . . ,g'(a ■ 
£,m x )) T ■ I n other words every column Xj is db scaled copy of the vector a T and com- 
pressible if a is compressible. We define a vector a compressible informally by saying 
that it can be well approximated in ^ p -norm by a sparse vector. Actually, any vector a 
with small ^q-norm can be approximated in £ p by its best K-term approximation a^j 
according to the following well-known estimate 

o- K (x) e d := \\a - a [K] \\ £dp < \\a\\ e dK 1/p - 1/q , p > q. (21) 

Thus by changing view point to get 

Y = <5>X + E 

we see that due to the random construction of we actually have a compressed sens- 
ing problem and known theory tells us that we can recover a stable approximation 
xj to Xj via ^i-minimization (see Theorem 13.21 for the precise statement). To get an 
approximation of a we then simply have to set a = Xj/\\xj\\£d for j such that ||%|Ld is 
maximal. Prom these informal ideas we derive the following algorithm. 



Algorithm 1: 






• Given m$ , mx , draw at random the 


sets $ and X as in (j 1 5 j) 


and (|16p. and 


construct Y according to (|18|) . 






• Set Xj = A(yj) := argmin^^^ \\z\\ £ d 






• Find 






Jo = arg 

3= 


max \\xj \\»d. 

-l,...,m x Jut 2 


(22) 


• Seta = x jo /\\x jo \\ e d. 






• Define g(y) := f{a T y) and f(x) := g 


[a-x). 





The quality of the final approximation clearly depends on the error between Xj and 
Xj, which can be controlled through the number of compressed sensing measurements 
m$, and the size of a ~ maxj 1 1 1 1 = maxj \gf(a • £j)\, which is related to the number 
of random samples mx- If (jlip is satisfied with a large, we shall show in Lemma 13.61 
with help of Hoeffding's inequality that also maxj = maxj \g'(a ■ £j)\ is large 

with high probability. If the value of a is unknown and small, the values of 11%!^ 
produced by Algorithm 1 could be small as well and, as discussed after the formula 
(fTT|h no reliable and tractable approximation procedure is possible. 

To be exact we will in the next section prove the following approximation result. 
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Theorem 3.1. Let < s < 1 and logd < m<j> < [log6]~ 2 d. Then there is a constant d x 
such that using mx • (m$ + l) function evaluations of f , Algorithm 1 defines a function 
f : B^d(l + e) — > M. that, with probability 



2 2 ' 
%X s a 



will satisfy 



where 



e -c[ mi , + e -V^d + 2e c* ) (23) 



11/ - /Hoc < 2C 2 (1 + 6) , (24) 



z^i = C" 




^ _log(d/m$) 
and C" depends only on C\ and C% from ([8]) and 



(25) 



Remark 2. L PFe s/ia/Z /12; v\ as defined by (|25p /or t/ie rest o/ i/iis section. Fur- 
thermore, we suppose that the selected parameters (s,e and m$ ) are such that v\ < 
y/ a(l — s) holds. See Remark^ (ii) for knowing how we can circumvent in practice 
the case that this condition may not hold, clearly invalidating the approximation (1241) . 

2. In order to show a concrete application of the previous result, let us consider, 
for simplicity, a class of uniformly smooth functions g such that |</(0)| / 0; hence, by 
Proposition l3.<j?L a = a(g) > is independent of the dimension d. If additionally we 



a 



choose q = 1, m<j> < d, and e > such that m$(e + y/log(d/m$))~ 2 = 0(8' 
5 > 0, for 5, a — > and mx = 0(a~ 2 ) for a — > 0, then, according to Theorem \3.1l we 
obtain the uniform error estimate 

\\f-f\\ 00 = 0(5), 5^0, 

with high probability. Notice that, if l/log(d) > 5 > 0, then the number of evaluation 
points mx • (m$ + 1) = 0((5 • a)~ 3 ) ; for 5, a — > 0, is actually independent of the 
dimension d. 



3.2 The Analysis 

We will first show that Xj is a good approximation to Xj for all j. This follows by 
the results from the framework of compressed sensing [21 El Ell [111 HE! H2J CEU • hi 
particular, we state the following useful result which is a specialization of Theorem 1.2 
from [36] , to the case of Bernoulli matrices. 

Theorem 3.2. Assume that 3> is an m x d random matrix with all entries being 
independent Bernoulli variables scaled with 1/y/m, see, e.g., (|16p . 

(i) Let < 5 < 1. Then there are two positive constants c\,C2 > 0, such that the 
matrix $ has the Restricted Isometry Property 

(1 - S)\\x\\% < \\Qx\\jm < (1 + 5)\\x\\ 2 fd (26) 

c 2 2 2 
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for all x G K suc/i i/iai ^supp(x) < C2m/ log(d/m) with probability at least 



1 



-cim 



(27) 



fwj Lei us suppose that d > [log6] 2 m. Then there are positive constants C,d x ,d 2 > 
0, such that, with probability at least 



1 - e" c i m - e- Vmd , 



(28) 



the matrix $ has the following property. For every x G M. d , e G M m and every natural 
number K < c' 2 m/ \og{d/m) we have 



|| A($x + e) - x|U <c(k 1/2 a K (x) id + max{||e|U, 0og d||e||^}) , 



(29) 



o~k{x)i<i := inf{||2; — z||^ : #suppz < iT} 
zs t/ie best ET-term approximation of x. 

Remark 3. (%) T/ie /irs£ pari o/ Theorem \3.2\ is well known, see, e.g., J3J/ or \lb\ Page 
15] and references therein. 

(ii) The second part of Theorem \3.2\ is relatively new. It follows from Theorem 2.3 of 
I36f combined with Theorem 3. 5 of U3f . and the first part of Theorem \3.2i Without 
the explicit bound of the probability (|28p . it appears also as Theorem 1.2 in 136)/ . 



Applied to the situation at hand we immediately derive the following corollary. 
Corollary 3.3. (i) Let d > [log 6] 2 m<j>. Then with probability at least 

I _ (g-ei m * _|_ e -\/m^d\ 

all the vectors Xj = A(yj), j = 1, . . . ,mx calculated in Algorithm 1 satisfy 

1 1/2-1/? 



L j\\e$ 



< C 



log(d/m$) 



+ maxl e 



3 1 1 ' 



i/logd||ej|Lm*} 



(30) 



where C depends only on C\ and C2 from ([8]) and ([9]) . 

(ii) If furthermore m<j> > logci holds, then with the same probability also 



\Xj Xj\\gd 



1 1/2-i/g 



log(d/m$) 



+ 



(31) 



where C depends again only on C\ and C2 from (JSj) and 
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Proof. We apply Theorem l3.2l to the equation yj = Qxj+Ej and K < c' 2 m$/ log(d/m$). 
To do so, we have to estimate the best K-term approximation error of OK{xj)i<i and 
the size of the errors Ej. We start by bounding oxixj)^. Recall that due to the 
construction of X every column is a scaled copy of the vector a T , i.e., xj = g' (a ■ £j)a T , 
so we have by (f2~Tj) 

r -i l/2-l/qr 

K-^a K ( Xjk < \ g '(a ■ • Hall,. ■ K^* < C x C 2 [^^y 

This finishes the proof of the first part. 

To prove the second part, we estimate the size of the errors using (fT9 



- • max fa V f{C,ij)<-Pi 

Z 1=1, ...,m$ 



max 



2m<j) i=i,...,m$ 
/ d 



^ a.kaig"{a ■ Qj] 



< 



< 



e||ff"||oo 
\/in$\\ej 



k,l=l 
x 2 



E 

vfc=l 



Ofcl 



< 



2m$ 



2/q 



E 

\.k=l 



< 



CiC , 2 | 
2m$ 



00 2 A /TO<J> 



leading to 



maxj £,■ ,m # 



2 A /m$ 



, \/log d||ej|^m # } < 



C\C2 



e • max < 1, 



log d 



(32) 



(33) 



(34) 



Together with our assumption m$ > log d this finishes the proof. 



□ 



Next we need a technical lemma to relate the error between the normalized version of 
Xj and a to the size of ||%||^d- 



Lemma 3.4 (Stability of subspaces - one dimensional case). Let us fix x G Mr, a G 

S d_1 , 7^ 7 £ R, and n G M d wrai/t norm ||n||^* < f i < |7|. // we assume x = 7a + n 



sign 7- 



a 



< 



2i/i 

l^lle* 



(35) 



Proof. Applying the triangular inequality and its reverse form several times and using 
that a G S^" 1 we get 



x 



sign 7- 



\ x \\e* 



< 



sign 7- 



\j\a 



+ 



h\ a 



\ x \\t* 



< TpriT — + 

\\ x \\4 



< 



2vi 
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□ 



Applied to our situation where otj = g'{a- £,j)a T + nj we see that the bound in (|35[) 
is best for maximal which justifies our definition of a in Algorithm 1. 

As a last ingredient for the proof of Theorem 13.11 we need a lower bound for 
maxj = i v .. jm;t , Hxjl^d. Since we have maxj ||xj||^d > max,,- \g'(a-£j)\ — max,,- \\xj — Xj\\ e d > 
maxj \g'{a ■ £j)| — v\ we just have to show that, with high probability, our random 
sampling of the gradient via the £j provided a good maximum. To do this we will use 
Hoeffding's inequality, which we recall below for reader's convenience. 

Proposition 3.5 (Hoeffding's inequality). Let X\, . . . ,X m be independent random 
variables. Assume that the Xj are almost surely bounded, i.e., there exist finite scalars 
aj , bj such that 

F{Xj -EXj £ [aj,bj]} = 1, 
for j = 1, . . . , m. Then we have 



>t \ <2e ^T=i^-^ 2 



Let us now apply Hoeffding's inequality to the random variables Xj = \g'(a • Cj)\ 2 - 



Lemma 3.6. Let us fix < s < 1. Then with probability 1 — 2e 



we have 



max \g'(a ■ £,)| > a(l - s), 

j=l,...,m x 



where a := E^(\g'(a ■ Cj)\ 2 )- 

Proof. By our assumptions (|10j) and (|9|) we have 



EX 3 = E^g'ia ■ £j)\ 2 ) = [ \g'(a ■ fifd/j^i® > a > 0, 



and 



Xj-EXj e \-a,Cl - a). 
Hence, by Hoeffding's inequality we have 



£i*v&)i s 



Using (f36j) we immediately obtain 



mxOL 



> smxct > < 2e 



-73- 

( 2 



(36) 



-^\g'(a-^)\ 2 >a(l-s), 



(37) 
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2 2 
<zpg s a 



with probability 1 — 2e c 2 .If \g'(a ■ £,j)\ 2 < a(l — s) for all j = 1, . . . , mx then 



(1371) would be violated. Hence for the maximum we have 



max \g'(a-£j)\ > y a(l — s). 
j=l,...,mx 



□ 



Finally we have all the tools ready to prove Theorem 13.1 

Proof of Theorem I3.lt 

Proof. Lemma 13.61 ensures that 

\g'(a-£ jo )\ > y/a(l - s) 



with probability 1 — 2e c 2 . Therefore, Corollary 13.31 together with Lemma EL 
show that with probability at least 



2 2 ' 



I _ ^ e - c i m <i> _|_ e -Vm<s>d _|_ 2 e 
a as defined in Algorithm 1 satisfies 



|sign(c/(a • £ jo ))a - a\\ £d < j== = — — (38) 



2^i 

£ 2 ~ ^(i - «) - v\ 

for the unknown sign of g'(a ■ £j ). 

Using this estimate we can prove that / as defined in Algorithm 1 is a good approxi- 
mation to /. For x S B R d(l + e) we have, 

\f(x)-f(x)\ = \g(a ■ x) - g(a ■ x)\ 

= \g{a ■ x) - f {a T ■ a ■ x)\ 

= \g(a ■ x) — g(a ■ a T ■ a ■ x)\ 

< C2\a • x — a ■ [a T a] ■ x\ 

= C2\a ■ (Id — a T a)x\. 

Because a(Id — a T a) = and therefore sign(g'(a • £, ))a(Id — a T a) = 0, we can further 
estimate 

\f(x)-f(x)\ < C 2 \a-(I d -bfa)x\ 

= C 2 |(a - sign<y(a • £j ))a) ■ (I d - a T a)x\ 

< C 2 \\a - sign^a • £j ))a||^ ■ \\x 

< 2C 2 (l + e 



"2 ""2 



y/a(l - s) - V\ 

□ 
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Remark 4. We collect here a few comments about this result. 

(i) Our recovery method differs from the one proposed by Cohen, Daubechies, DeVore, 
Kerkyacharian, Picard illf . In their approach, the domain is taken to be [0, 1]^ and 
they make heavy use of the additional assumption Ylj=i a j = 1 an< ^ a j — 0- This allows 
them to derive an almost completely deterministic and adaptive strategy for sampling 
the function f in order to find first an approximation to g and only then addressing the 
approximation to a. Here we follow somehow the opposite order, first approximating a 
and then finding a uniform approximation to g and, eventually, to f as well. Notice 
further that not having at disposal additional information on a, which is fully arbitrary 
in our case, we need to use a random sampling scheme which eventually gives a result 
holding with high probability. 

(ii) Note that Theorem \ 3.1\ gives an a priori estimate of the success probability and 
approximation error of Algorithm 1. If the problem parameters q,C\,C2, and a are 
known, they can be used to choose m$ and mx big enough to have, say, a prescribed 
desired accuracy 5 with probability at least 1 — p. 

However once Algorithm 1 has been run we have the following a posteriori estimate. 
With probability at least 1 — ( e - c 'i m * + e -vm*d) we /j aue 

||/-/||oo<C 2 (l + e) ir ^_. 

\\ x jo \\£$ 

Hence, the ratio ,, 2y \ — <C 1 defines an a posteriori indicator that the number of sam- 

pies mx and m$ has been properly calibrated, otherwise just more points will be drawn 
until such a condition is obtained. 

(Hi) The parameter e is chosen at the very beginning in the Taylor expansion (|14p 
and, from a purely theoretical point of view, could be chosen arbitrarily small. Unfor- 
tunately, this may affect the numerical stability in the approximation in (|14p of the 
derivative by means of a finite difference. Hence, the parameter e should not be 

taken too small in practice. Up to some extent this may be compensated by choosing a 
larger number of points m$ in (|25|) , as in our expression for v\ in (|25|) e appears in 
a ratio of the form . We return in more detail to this point later in Section \5.1[ 
In recent numerical experiments associated to the work \31^ . we have been experiencing 
very stable reconstructions with reasonable choices, e.g., e ~ 0.1. Hence we do not 
consider this issue of any practical relevance or difficulty. 

3.3 Discussion on tractability 

The approximation performances of our learning strategy are basically determined by 
the optimal value of a (see, e.g., (fTOj) ). which is achieved by the choice 

a := \g' (a ■ x)\ 2 dfi S d-i(x). (39) 

Due to symmetry reasons this quantity does not depend on the particular choice of a. 
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The rotation invariant probability measure figa-i on S^ 1 is induced on the sphere 
by the (left) Haar measure on the Lie group of all orientation preserving rotations. For 
a given k x d matrix U such that UU T = I k (i.e., with orthonormal rows) we define 
the measure \i k on the unit ball B^k in M fc induced by the projection of /i§d-i via U, 
i.e., for any Borel set B C B-^k we define 

H k (B) := U^-^B) := » S d-i(U^(B)). (40) 

Since /Ugd-i is rotation invariant, fj, k does not depend on the particular matrix U, 
and is itself a rotation invariant measure on B^k. Hence for any summable function 
h : B R k — > R, for any k x k orthogonal matrix O such that OO t = I k = O t O, and for 
any k x d matrix U such that UU T = 4, we have the identities 

/ h(Oy)dfj, k (y) = h(y)dfj, k (y) = h(Ux)dfi § d-i(x). (41) 
JB Rk JB Rk Jsd-i 

The following result is well known. We refer to [301 Section 1.4.4] for the case of 
C n . The proof given there works literally also in the real case. 

Theorem 3.7. Let 1 < k < d be natural numbers. Then the measure fi k defined in 
(|4U|) is given by 

T(d/2) n ., d-2-k 

d ^ y) = ^ m -k)/2) {1 - M ^ 2 

Notice that as d — > oo, and for fixed k, the measure fi k becomes more and more 
concentrated around 0, in the sense that, for e > fixed 

Hk(B R k(e)) -> 1, for d oo, 

very rapidly (typically exponentially). By using the explicit form of the measure ji k 
we can compute 

» k (B Rk (s)) = 1- / (l-\\y\\l 2 )^dy 

7T fc / 2 r((d - fc)/2) J BRk \B R k (e) 



2T(d/2) 



2T(d/2) d~2-k = 2 
~ ~ T(k/2)T{(d-k)/2) e 



By Stirling's approximation Uk/ l^nIl k ^/o\ ~ Xa-i/l^-t-i > tnus for k and e con - 



r(fc/2)r((d-fc)/2) ~ y Trk^^d-k) 
stant 

/j, k (B Rk (e)) -> 1 

exponentially fast as d — > oo. For k = 1, this phenomenon can be summarized in- 
formally by saying that the surface measure of the unit sphere in high dimension is 
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concentrated around the equator [21] . Hence in case d 3> k we may want to take into 
account possible rescaling, i.e., working with spheres of larger radii, in order to even- 
tually consider properties of g (actually the matrix H g ) on larger subsets of see 
also Remark Without loss of generality, by keeping in mind this possible rescaling, 
we can therefore assume to work with the unit sphere. 

For k = 1, we observe, that a as in ([39]) is determined by the interplay between 
the variation properties of g and the measure fj,\. As just mentioned above, the most 
relevant feature of Hi is that it concentrates around zero exponentially fast as d — > oo. 
Hence, the asymptotic behavior of a exclusively depends on the behavior of the function 
g' in a neighborhood of 0. 

To illustrate this phenomenon more precisely, we present the following result. 

Proposition 3.8. Let us fix M EN and assume that g : — > R is C M+2 -differentiable 
in an open neighborhood U of and ^r<?(0) = for £ = 1, . . . , M. Then 

oc{d) = ^^ 2 \ )/2) £ l^(y)l a (l ~ V 2 )^dy = 0(d~ M ), for d -> oo. 
:st o 



Proof. First of all, we compute the t th moment of the measure 7r i/2p(^\w2) 0- 



r(d/2) f 1 yi(1 y2) ^ dy _ [i + (-iflr(d/2)r((i + 1)/2) m 



vr 1 /2r(( ( i- l)/2) y_r v *' u 2^T((d + £)/2) 

Notice that all the odd moments vanish. By Taylor expansion of g' around and by 
taking into account that "fjjSW — for I = 1, . . . , M, we obtain 

to) = £ jihy.h 3 ^ 1 + ° iyM+1) = ^\h^ 9{0)yM + 0{yM+ly 



Hence, 



and 



|2 _ / 1 a „/n\ l „,2M i /-/v„.2M+:h 



r(d/2) 



= "TTJTTTTT TwTvT / |</(y)| 2 (l - 2/ 2 )^ 



7rV2r((d - l)/2) 

r(d/2) 



1 



7rV2r((d-l)/2) ^ 

r(d/2) / / 1 d M+1 



g'(y)\\l - y^dy + / \g'(y)\ 2 (l - y 2 )^dy 

JB m \U 



/ i d + \ f d-3 

+ / Ofe 2, '+ 2 )(l -y 2 )"dy+ [ | 9 '(„)| 2 (1 - y')^dy 
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Notice that we consider the (2M + 2) th moment in the expression above because the 
previous one is odd and therefore vanishes. Now, the term J Br \h \g'(y)\ 2 (l — y 2 )^~dy 
goes to zero exponentially fast for d — > 0. By using ()42j) we immediately obtain 

a{d) = i/9 r(d/2) - - f 1 \g\y)\ 2 {l-y 2 )^dy 

yJ 7T 1 /2r(((i- l)/2) L\ 

/n f T(d/2)T((l + 2M)/2) \ 
= °{ T((d + 2M)/2) J' d ^°°- 

By Stirling's approximation, for which T(z) = (§)* + ^(1 + f° r z ~ * 00 j we 

obtain 

r(d/2)r((l+2M)/2 ) ^ jM _ lV2/1 , r,Tijr\M r j , 
r((d + 2M)/2) 

This eventually yields 



d ( rf -l)/ 2(1 + 2M) M (d + 2M) -(^+M) ) rf . 



OO. 



a (d) = r ^/ 2 ) Z 1 |</(y)| 2 (l - y 2 )^dy = O (d~ M ) , d -> oo. 

w 7T 1 /2 r ((d- 1)/2) 7-1 v ; 



□ 



The number x (m# + 1) of points we need in order to achieve a prescribed 
accuracy in the error estimate (|24p of Theorem 13.11 depends on a. Proposition 13.81 
ensures that, if g'(y) does not vanish for y — > super-polynomially, then the dependence 
of a (therefore of the error estimate and the number mx x (m$ + 1) of points) on d 
is at most polynomial. According to this observation we distinguish three classes of 
ridge functions: 



(1) For < q < 1, C\ > 1 and C 2 > a > 0, we define 

T\ := J-}(a>o, q, C\, C2) := {/ : B K d — > R : 
3a E [lalLd = 1, llalLd < Ci and 

ut 2 q 

3g G C 2 (B R ), \g'(0)\ > a > : f(x) = g(a ■ x) }. 



(2) For a neighborhood U of 0, < q < 1, C\ > 1, C 2 > a > and N > 2, we 
define 

J-J := Tl(U,aLQ,q,C\,C2,N) := {/ : B Rd -)• R : 

3a G M d , ||a|| € d = 1, ||a||^ < d and 3g G C 2 (B t ) nC^W) 

30 < M < N- 1, b (M) (0)| > a > : /(x) = <?(a • x) }. 

(3) For a neighborhood W of 0, < q < 1, Ci > 1 and C2 > ao > 0, we define 

Fl := j|(W,a ,g,Ci,C 2 ):={/:S Rd ^R: 

3a€lR d , ||a||^ = l,||a||^ <Ci and 3<? € C 2 (-Br) n C°°(b() 
\ g ( M )(0)\=0 for all M € N : f(x) = g{a-x)}. 
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Theorem 13.11 and Proposition 13.81 immediately imply the following tractability result 
for these function classes. 

Corollary 3.9. The problem of learning functions f in the classes J-\ and T\ from 
•point evaluations is strongly polynomially tractable and polynomially tractable re- 
spectively. 

On the one hand, let us notice that if in the class we remove the condition 
llalLd < Ci, then the discussion on the functions described in Remark [T] shows that 
the problem actually becomes intractable. On the other hand, we conjecture that the 
restriction imposed by a condition such £LS 1 1 Ob 1 1 £d < C\ should instead give to the prob- 
lem some sort of tractability. Unfortunately, our learning method and approximation 
estimates in Theorem 13.11 do not provide any information about the tractability of the 
problem for functions in the class T\. 

4 The General Case k > 1 

In this section we generalize our approach to the case k > 1, i.e., we consider k-ridge 
functions 

f{x) = g(Ax). (43) 

Obviously, the sum of k ridge functions (as appearing for example in (|13p ) is a /c-ridge 
function and the same holds true also for the product. 

We will proceed as in the one-dimensional case, giving first the basic ideas, which 
motivate the recovery algorithm and then stating and proving our main theorem. 
Remember that we assume, that A is a k x d matrix such that AA T = 1^, and 
g : B Rk (1 + e) — > R is a C 2 function. 

4.1 The Algorithm 

As before we consider a version of Taylor's theorem giving access to the matrix A. For 
£ € B^d, <p € B R d(r), e,r € R+, with re < e, we have the identity 

[y 9 (A0 T A],> = /(g + efP e ) " /(0 -|[ y r V 2 /(C)y], (44) 

for a suitable ((£,,ip) 6 B R d(l + e) and thanks to Q the term [9? T V 2 /(C)y] is again 
uniformly bounded as soon as ip is bounded. 

As in the one-dimensional case we now consider (|44p for the m$ directions in the set $ 
and at the mx sampling points in the set X, where X , $ are defined as in (|15p and (|16p 
respectively. Again we collect the directional derivatives J \7g(A£j) T A, j = 1, . . . , mx 
as columns in the d x mx matrix X, i.e., 

X = (A T Vg(Mi), • • • , A T Vg(AU x )), (45) 
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and using the matrices Y and £ as denned in (|18p and (|19p . we can write the following 
factorization 

<$>X = Y - E. (46) 

Similarly to the one-dimensional case we find that the matrix X has a special 
structure, which we will exploit for the algorithm, i.e., X = A T Q T , where Q = 
(\7 g(A^i) T \ . . . \SJ g{A^ mx ) T ) T . The columns of X are now no longer scaled copies 
of one compressible vector but they are linear combinations of k compressible vectors, 
i.e., the rows of the matrix A. Thus compressed sensing theory again tells us that we 
can stably recover the columns of X from the columns of Y via ^-minimization and 
in consequence get a good approximation X to X. 

Furthermore, since A has rank k, as long as Q T has full rank, also X will have rank k 
and moreover the column span of the right singular vectors of X T = USV T will coin- 
cide with the row span of A, i.e., A T A = VV T . Moreover, V T gives us an alternative 
representation of / as follows, i.e., 

f(x) = g{Ax) = g{AA T Ax) = g{AVV T x) =: ~g(V T x), 

where g(y) := g(AVy) = f(Vy). If X is a good approximation to X, then we can 
expect that the first k right singular vectors of X have almost the same span as those 
of X and thus of A, which inspires the following algorithm. 



Algorithm 2: 






• Given m&,mx, draw at random the sets $ and X 
construct Y according to (|18p. 


as in 


(USD and (HU), and 


• Set Xj = A(yj) := arg min % . =<1 , 2 \\z \\ e p for j 
(xi j • • • > x mx ) . 


= 1,. 


. , mx, and X = 


• Compute the singular value decomposition of 






*-(^)(„H)( 


v 2 T ) 


(47) 


where Si contains the k largest singular values. 






• SetA = Vf. 






• Define g(y) := f{A T y) and f(x) := g{Ax). 







The quality of the final approximation of / by means of / depends on two kinds of 
accuracies: 



1. The error between X and X, which can be controlled through the number of 
compressed sensing measurements m$; 
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2. The stability of the span of V T , simply characterized by how well the singular 
values of X or equivalently Q are separated from 0, which is related to the number 
of random samples mx- 

To be precise, in the next section we will prove the following approximation result. 

Theorem 4.1. Let \ogd < ?n<j> < [log6] 2 d. Then there is a constant d x such that using 
m#-(m$+l) function evaluations of f, Algorithm 2 defines a function f : B K d(l + e) — > 
R that, with probability 



1 — ( e _c 'i m * _|_ e -Vm#<i i j {e ikC. 



(48) 



will satisfy 



where 



11/ -/Hoc <2C 2 Vk(l + e) 



V2 



C k 1/q 



- s) - u 2 ' 

1/2-1/, ek 2 \ 
+ 



(49) 



log(d/m$) 

and C depends only on C\ and C 2 (cf. (JSJ) and Qj. 
4.2 The Analysis 

We will first show that X is a good approximation to X by applying Theorem 13.2 
columnwise. This leads to the following corollary. 

Corollary 4.2. Let logd < m$ < [log6] 2 d. Then with probability 



I _ (g-ci m * _|_ g-V™*^ 

the matrix X as calculated in Algorithm 2 satisfies 

\\X-X\\ F <C^ U'l \ - ™* . 

\ Llog(a/m$)J / 

where C depends only on C\ and C 2 (cf. (|5J) and ©j. 

Proof. The proof works essentially like that of Corollary 13.31 We decompose 



(50) 



IX -X 



3=1 



The best X-term approximation of may be estimated using 



\V=1 \u=l 
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which leads to 



The norms of may be estimated similarly to the proof of Corollary 13.31 as 

^ C\ C 2 k 2 e , ^ C? C 2 A: 2 e 

£i < m * < — - — and £,' ,"■* < — . 

2 2 A /m$ to ° 2m$ 



Putting all these estimates (with the choice K ~ m<j>/log(d/m$)) into Theorem 13.21 
we obtain the result. □ 

Remark 5. The construction Xj = A(yj-) := argmin % . = $ 2 INI^, for j = 1, . . . ,mx, 

and X = (xi, . . . ,x mx ) and Corollary \4-2\ are not the unique possible approach to 
approximate X. As we are expecting X to be a k-rank matrix for k <C min{d, mx}, 
one might want to consider also nuclear norm minimization, i.e., the minimization 
of the l\-norm of singular values, as a possible way of accessing X from m<j> random 
measurements, as in the work \15\ \2b\ \28\j . However, presently no estimates of the 
type (|29p are available in this context, hence we postpone an analysis based on these 
methods fully tailored to matrices to further research. 

Next we need the equivalent of Lemma f3.4l to relate the error between the subspaces 
defined by the largest right singular values of X and X respectively to the error \\X — 
X\\f- We will develop the necessary tools in the following subsection. 

4.2.1 Stability of the singular value decomposition 

Given two matrices B and B with corresponding singular value decompositions 



>-i« ^Ho 1 



and 

B=(Ui U 2 



Ei W V{ 
E 2 A % T 



T 



where it is understood that two corresponding submatrices, e.g., Ui,U\, have the same 
size, we would like to bound the difference between V\ and V\ by the error ||B — B\\f- 
As a consequence of Wedin's perturbation bound [54] . see also [32j Section 7], we have 
the following useful result. 

Theorem 4.3 (Stability of subspaces - Wedin's bound). If there is an a > such that 

min |<t/(Ei) - cr^(S 2 )| > a, (51) 

t,i 

and 

min |ct|(Ei)| > a, (52) 

I 
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then 



WViV? - ViVx\\f < *\\B-B\\ F . (53) 
a 



The conditions (|51|) and (|52|) are separation conditions. The first says that the 
singular values of Si are separated from those of £2. Actually, strictly speaking the 
separation is between Si and £2. However, if \\B — B\\p is sufficiently small compared 
to a, then Weyl's inequality [35] 



\a £ (B) - a e (B)\ < \\B - B\\ F , 

guarantees that the two separations are essentially equivalent. The second condition 
says that the singular values of Si or Si have to be far away from 0. 
Applied to our situation, where X has rank k and thus S2 = 0, we get 

and further since a^{X T ) > a^{X T ) — \\X — X\\p, that 

\\V l V--V 1 V-\\ F < { lf^ V l- ■ (55) 
(JkiX 1 ) - s /m^v 2 

As final ingredient we need to estimate the /c-th singular value of X. The next subsec- 
tion will provide us with a generalization of Hoeffding's inequality, that can be used 
to show that with high probability on the random draw of the sampling points £j the 
/c-th singular value of X is separated from zero. 

4.2.2 Spectral estimates and sums of random semidefinite matrices 

The following theorem generalizes Hoeffding's inequality to sums of random semidefi- 
nite matrices and was recently proved by Tropp in [331 Corollary 5.2 and Remark 5.3], 
improving over results in [1], and using techniques from [29] and [25j . 

Theorem 4.4 (Matrix Chernoff). Consider Xi, . . . ,X m independent random, positive- 
semidefinite matrices of dimension k x k. Moreover suppose 

o-i{Xj) < C, (56) 

almost surely. Compute the singular values of the sum of the expectations 



= o~i ( EJ¥j J and /i min = a k | ^ EXj J , (57) 




then 




Mmax(l + s) 

(I _|_ s )\ c 

}<k[ ± , (58) 
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for all s > (e — 1), and 



.2 



0~k 



Y2 X i) ~ ^ min - ~ s ^min \ < he , (59) 



for all s G (0, 1). 



Applied to the matrix X T the above theorem leads to the following estimate of the 
singular values of X T . 

Lemma 4.5. For any s G (0, 1) we have that 

o- k (X T ) > vWa(l - s) (60) 



with probability 1 — he 2fcC 2 . 

Proof. The proof is based on an application of Theorem 14.41 First of all note that 

X T = gA = U g Tg[V g r A], 

hence T x t = Tg. Moreover 



= \J o-i(G T G), for alH = 1, . . . , k. 

Thus, to get information about the singular values of X T it is sufficient to study that 
of 



Q T G = Y J Vg{Ai 3 )Vg(A^ 

i=i 



We further notice that 



<Ti(Vg(Mj)Vg(Mj) T ) < £ \Vg(Mj)tVg(A$3)e\ 2 < ■= C. 

\e,e'=i J 

Hence Xj = V g(A£j)\7 g(A^j) T is a random positive-semidefinite matrix, that is almost 
surely bounded. Moreover 

EXj = E^g{A^)Vg(Mjf = [ V g{Ax)V g{Ax) T d^ d .,{x) = H g . 

Hence, remembering that the singular values of H g are equivalent to that of H*, by 
condition ([TO]) we have /i max = rnxo~i(H g ) < m^kC^ and [i m \ n = mx&^Hg) > mxoi > 
0. In particular 

mxk 2 C 2 > /u max > /i min > m x a > 0. 
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By an application of Theorem 14.41 we conclude that 
a k (X T ) = o k {Q) 



<Tk ^J2Vg(A^)Vg(Mj) T ] > ^ - 7) > ^m x a{l - s), 



with probability 



Mmin^ 2 ~m x as 2 



1 - ke 2fcC 2 > 1 - ke 2feC 2 , 
for all s e (0,1). □ 

Finally we have collected all the results necessary to prove Theorem 14.11 
Proof of Theorem I4.lt 

Proof. Combining Corollary 14. 2\ Theorem 14.31 and Lemma 14.51 shows that with prob- 
ability at least 



1 _ e -cim* + e -V^d + ke 



for the first k right singular vectors of X and X we have 

\\V{V? - ViVf \\ F < 



vMl - s)-v 2 

Recalling from the proof of Lemma 14.51 that the (first k) right singular vectors V± of 
X T have the form = Vg A then shows that A as defined in Algorithm 2 satisfies 

\T a XT au II aT-it-xtT a fr f/Tii Ht^t/T frfrTii ^ ^^2 



{A 1 A- A 1 A\\ F = WA 1 VgVg 1 A - V X V{ \\ F = \V X V{ - V X V{ \\ F < 



y/a(l -s)-u 2 



Using this estimate we can prove that / as defined in Algorithm 2 is a good approxi- 
mation to /. Since A is row-orthogonal we have A = AA T A and therefore 

\f(x)-f(x)\ = \g(Ax)-g(Ax)\ 

= \g(Ax) - g(AA T Ax)\ 

< C 2 Vk\\Ax - AA T Ax\\ ek 

= C 2 Vk\\A(A T A - A T A)x\\ £ k 

< C 2 Vk\\(A T A- A T A)\\ F \\x\\ e * 

< 2C 2 Vk(l + e) =^= . 

V«(l -s)-u 2 

□ 
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Remark 6. (i) Note that Theorem \4-l\ is again an a priori estimate of the success 
probability and approximation error of Algorithm 2. Once Algorithm 2 has been run we 
have the following a posteriori estimate. With probability at least 1 — (e~ c i m * + e~ v/m * rf ) 
we have that 



11/ -/Hoc <2C7 2V / ^(l + e) 



o- k (X T ) 

(ii) We further observe that Theorem does not straightforwardly reduce to Theo- 
rem \'J.l\ for k = 1, because in the one- dimensional case we used the simpler maximum 
strategy as in ([22]) instead of the singular value decomposition (pET|) . 

4.3 Discussion on tractability 

Recall, that the push-forward measure fik = 7r fc/2r[(d-fc)/2) ^ ~~ H^llf*) ~ & k of f^d-i 
on the unit ball B-^k was determined in Theorem 13.71 as the measure, for which 

H g = [ Vg{Ax)Vg{Ax) T diJL^{x) 

= Jifnffl / V g (y)V g {y) T {l-\\y\\\) d -=^dy. 
7r fe / 2 r((d - k)/2) J B%k £ 2 

As an instructive example, let us apply this formula to the case when g is a radial 
function, i.e., 

g(y) = go(\\y\Uk), 

for a function go : [0, 1] — > R sufficiently smooth, and g (0) = 0. 

A direct calculation shows, that Vg(y) = ■ y, where r = ||y||^fe, and 

Vg(y)Vg(yf = 9 -^yy T . 

Hence, 

r(d/2) f 5o(l|y||^) 2 i^xi™^ 

™ y = ^r ( (d - *)/2) 7l~ !l,!lj ~ M *> 2 *" 

If i / j, the integral vanishes due to the symmetry of B^k. If i = j, we get again by 
symmetry 



T(d/2) r 9 {\\y\\i* ) 2 

7r^T((d-k)/2)J BRk \\y\\% 



( H g)a = *^7zFtT77^ I "" n ""' 2 " S/ yf(l-\\y\\%) d ' "dy 



'2 



r(d/2) 



-2-fe 



2F(d/2) fl g' (r)\l - r 2 ) d -^r k ~ l dr =: a(k,d). 



kT({d- k)/2)T{k/2) Jo 



27 



Hence, H g = a(k,d)Ik- Similarly to Proposition 13.81 we can expand g' into a Taylor 
series 

1=2 ^ >' 

If we assume that g$\o) = 0, for all £ = 1, ... , M, but <7q M+1 ^(0) / 0, then we obtain 
and, by Stirling's approximation, 

= of r ^ 2 > ) 

\T{d/2 + M)J 
= O (d~ M ) , d -> oo. 

From these computations, we deduce that learning functions f(x) = g(Ax), where g 
is radial (or nearly radial), using our method has usually polynomial complexity with 
respect to the dimension d. 



5 Extensions and Generalizations 

We assumed throughout the paper that the function / is defined on the unit ball B R d 
of To be able to approximate the derivatives of / even on the boundary of B^d, 
we actually supposed, that / is defined also on an e neighborhood of the unit ball. 
Furthermore, we assumed that the function values may be measured exactly without 
any error. The main aim of this section is to discuss the possibilities and limitations of 
our method. Firstly, we discuss the numerical stability of our approach with respect to 
noise. Secondly, we deal with functions defined on a convex body C R d . As it is our 
intention here only to sketch, still rigorously, further interesting research directions, we 
limit our discussion to the case of k = 1. 



5.1 Stability under noisy measurements 

Let us assume that the function evaluation in (|14|) can be performed only with certain 
precision. We again collect the rax x instances of (|14p as 

W 

§X = Y -S + — , (61) 

where the (i, j) entry of W (denoted by Wij) is the difference between the exact value 
of f(£j + eipi) — f(£j) and its value measured with noise. This leads to a compressed 
sensing setting 

W 

Y = <$>X + 8 . (62) 
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Applying Theorem 13.21 we obtain a substitute for Corollary 13.31 with £ replaced by 
£ — W/e. Therefore we would like to estimate the norm of Wj (the j-th column of 
W) in £™* and If we merely assume that the noise is bounded (i.e. \wij\ < u), 

the best possible estimate is — ^V 771 *- We observe that the more sampling 

points we take the greater is the level of noise. This effect of amplification of the noise 
is actually known under the name of noise folding [2] and, unfortunately, corrupts the 
estimate (|3"Tj) . see also [ITJ Section 4] for a discussion in a related context. 

Let us therefore sketch a different approach. We make the rather natural assump- 
tion that Wij is a random noise. 

The analogue of Theorem 13.21 for the recovery of x from noisy measurements 
y = §x + u), where u = (wi, . . . , w m ) are independent identically distributed (i.i.d.) 
Gaussian variables with mean zero and variance a 2 , was given in the work of Candes 
and Tao [9]. They proposed a certain ^i-regularization problem, whose solution (called 
the Dantzig selector) satisfies 



that this estimate scales very favorably with d (only as ydog d) and, moreover, does 
depend only on the sparsity of x, and not anymore on the number of measurements 
m$. Therefore, there is no noise folding in this case. 

The equation (|62p requires a combination of Theorem 13.21 and the result of Candes 
and Tao. Namely, we would like to reconstruct x if y = &x + e + ui is given, where e 
is a deterministic error and w is a vector of i.i.d. Gaussian variables. Obviously, the 
detailed analysis of this issue goes beyond the scope of this paper. Nevertheless, let 
us present some numerical evidence of the numerical stability of our approach in the 
presence of random noise. 

We consider the function 



in dimension d = 1000. We use a variant of Algorithm 1 based on l\ minimization to 
identify the active coordinates of /, cf. [31] for details. We suppose that function evalu- 
ations were distorted by Gaussian error vu with oj « A/"(0, 1) and v G {0.1, 0.01, 0.001}. 
We chose e = 0.1 in the approximation ()14|) . For each number of points mx £ = 
1, . . . , 10} (x-axis) and each number of directions m$ G {20^, £ = 1, . . . , 10} (y-axis) 
we produced one hundred trials. The success rates of recovery go from white color (no 
success) to black (100 successful recoveries). 




Especially, if x is a /c-sparse vector, then \\x — x\\gd < C-\/2logd-y/k + 1-0". We observe 




1000 



(63) 
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Figure 2: Recovery of active coordinates of f{x) given by (|63|) with v = 0.1, v = 0.01 
and v = 0.001, from left to right respectively. Let us mention that the success rates 
of recovery for noise-free setting are hardly distinguishable from the last picture above 
(y = 0.001). 

We conclude from Figure 2 that there is a smooth increase of the rate of successful 
recovery with decreasing noise power and a fully stable recovery behavior. 



5.2 Convex bodies 

A careful inspection of our method shows, that it may be generalized to arbitrary 
convex bodies. Let us describe the necessary modifications and give an overview of the 
results for the case k = 1. First of all, one has to replace Q by 

H-f := [ X7f(x)X7f(x) T dfi n (x). (64) 
Jn 

Here, /xq is a probability measure on £1 and the points in X (cf. (|15|) ) are selected at 
random with respect to For Q = B R d, we simply selected fin = fi$d-i to be the 
normalized surface measure on S^ -1 . This corresponded to the fact, that a G S d_1 was 
arbitrary and therefore a-priori no direction was preferred. To be able to evaluate the 
derivatives of / even on the boundary of we suppose, that / is actually defined on 
an e neighborhood of O, namely on the set O + e := {x G M. d : dist(f2,x) < e}. The 
function g is supposed to be defined on the image of f2 + e under the mapping x — > a-x, 
i.e., on an interval. We assume again ([9|). 

Surprisingly enough, these are all the modifications necessary to proceed with the iden- 
tification of a and ()38() holds true under these circumstances. 

The proof of Theorem 13.11 was based on the fact, that for every y G -Br, we can 
easily find an element x y G B R d, such that a-x y = y. It is enough to consider x y = a T y. 
In the case of a general convex set Q, we first need to define for any a G S rf_1 fixed, a 
function x. : a(ft + e) — > f2 + e given by y i— > x y , and such that 

a ■ x y = y. 
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In particular, for all y G a(£l + e) we need to find 

x y £ fl + ef]{x eR d : a ■ x = y}. 

Since both + e and the solution space {x € M. d : a • x = y} are closed convex sets 
in M. d , one could use an alternating projection algorithm for finding x y [3]. Thus, we 
can assume that, at least algorithmically, this map can be computed. Moreover, and 
alternatively, since the operation described above, i.e., finding x y £ B-^d, such that 
o • x y = y, has to be executed as many times as we need to define, e.g., an appropriate 
spline approximation of g, we may proceed as follows: we find first x max ,x m i n £ B R d, 
such that a ■ x max = max xe s Rd a ■ x and a ■ x m ; n = mm. x& B Rd & • x. Then any other x y 
such that y = a ■ x y is computed very fctst by Xy — A<yX m in H~ (1 — ^y)^max for some 
A,G[0,1]. 

With this modification, also Theorem 13.11 holds true, with the definition of g given 
in Algorithm 1 replaced now by 

g{y) ■= f(x y ), yea(n + e) 

and (fM|) replaced by 

11/ - /Hoc < 2C 2 (diam(0) + 26) j=== ■ 

V"! 1 -s)-v\ 

Unfortunately, and this seems to be the main drawback of this approach, the diameter 
of S7, diam(f2) = max x iX ' 6 n 11^ — x '\\e^ m& Y grow with d. This is especially the case, 
when Q = [— 1, l] d , which gives diam(J7) = \f 7 ld. 

5.3 An approach through Minkowski functional 

To get better results for specific convex bodies (i.e. Q = [—1, l] d ), we propose another 
approach. We stress very clearly that up to now this is only to be understood as an 
open direction, which is a subject of further research. 

We assume, that f2 is a closed convex set, which is absorbing and balanced, i.e. 

• for every x € there is a t = t(x) > 0, such that tx 6 Q, 

• a£l := {ax : x £ £1} C f2 for every a G [—1, 1]. 
Then we can define its Minkowski functional as 

pn(x) := inf{r > : x/r £ Q}, x G R d . 
It is well known, that this expression is actually a norm and is its unit ball. Hence 

sup pn(x — x')<2. (65) 

x,x'aQ. 

This allows us to replace the inequality 

\{a — a) ■ [x y — x)\ < \\a — a, 1 1 2 • \\x y — 1 1 2 
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by 

| (a — a) ■ (x y - x)\ < \\a - a\\' u ■ \\x y - x\\u- 

Here, || • ||n = pn(-) and || • ||q is its dual norm. According to ([65]) . this solves the 
problem of the factor diam(fi) - the diameter of Q with respect to || • ||n is always 
bounded by 2. Unfortunately the problem is transferred to the second factor, namely 
|| a — a||^- For this, one would need the analogue of Theorem 13.21 with the £2' n0Tm i n 
(|29p replaced by || • ||^. While any treatment of this general case is clearly beyond the 
scope of this paper and remains a subject of further investigation, we can shortly sketch 
what happens in the special case Q = [—1, l] d . Then we simply have || • \\q = \\ ■ ||^d 
and || • ||^ = || • ll^d. To estimate ||a — d||^d we would have to combine Lemma 3.1 in 
[llj with (|38p and would get again a result that does not depend on the dimension d. 
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